Collaborative artificial intelligence system for investigation of healthcare claims compliance

Healthcare fraud, waste and abuse are costly problems that have huge impact on society. Traditional approaches to identify non-compliant claims rely on auditing strategies requiring trained professionals, or on machine learning methods requiring labelled data and possibly lacking interpretability. We present Clais, a collaborative artificial intelligence system for claims analysis. Clais automatically extracts human-interpretable rules from healthcare policy documents (0.72 F1-score), and it enables professionals to edit and validate the extracted rules through an intuitive user interface. Clais executes the rules on claim records to identify non-compliance: on this task Clais significantly outperforms two baseline machine learning models, and its median F1-score is 1.0 (IQR = 0.83 to 1.0) when executing the extracted rules, and 1.0 (IQR = 1.0 to 1.0) when executing the same rules after human curation. Professionals confirm through a user study the usefulness of Clais in making their workflow simpler and more effective.

www.nature.com/scientificreports/professionals because they need to explicitly relate suspicious claims to some specific text in a policy document to substantiate their investigations.
We developed Clais to tackle the limitations of existing approaches based on manually defined rules or datadriven machine learning algorithms, and to implement a collaborative workflow where professionals and AI cooperate to identify claims that are not compliant with a healthcare policy.Clais (see Fig. 1) analyzes policy documents, identifies paragraphs of text that may define compliance or non-compliance definitions of patients' benefits and translates them into rules that are both human interpretable and machine executable.We build on our previous work [24][25][26] leveraging different natural language processing techniques and a rich domain ontology (co-created with domain experts), which captures repeatable templates for the translation of policy text into rules, even in the absence of labelled data.The rules, formalized as knowledge graphs, consist of a set of logical conditions having precise semantics (defined in the ontology).Clais displays the rules, together with the corresponding paragraphs of text from the policy document, in an intuitive user interface where human experts can easily modify and validate them; they can also interactively build new rules (using a library of conditions automatically derived from the ontology) corresponding to fragments of text that the system failed to identify.After validation, all rules are stored in a shared knowledge store that complements the policy documents with referenceable, human-interpretable, and formal representations of patients' benefits and eligibility requirements.Finally, Clais executes the validated rules directly on claims data to identify potentially non-compliant claims (referred to as "at-risk") and presents the results in a visual interface where professionals have access to both aggregated and detailed information on claims at-risk, including explanations on why a claim is non-compliant with a rule, and corresponding evidence from claims data.Such evidence-based view helps investigators in assessing claims at-risk, identifying providers that violate policies, and prioritize their work.
We evaluate Clais using two dental policies from two states in the US.Dental spending by US government programs increased by 25% from 2020 to 2021 27 , and policy makers are considering extending dental coverage in Medicare 28 : these facts make the dental domain interesting for FWA investigations, and our domain experts confirmed that the structure of rules extracted from dental policies is generalizable to other domains.We measure the performance of Clais on two tasks.In the first task we use Clais to extract rules from policies and report its performance with respect to a ground truth consisting of rule definitions manually created by domain experts.In the second task we use Clais to identify non-compliant claims by executing both the automatically extracted rules and the corresponding validated rules on a database of almost 800 million claims containing almost 45 million dental claims.In the second task we also compare Clais to two baseline data-driven models using a ground truth consisting of non-compliant claims labelled using algorithms manually developed by domain experts.Finally, we report the results of a user study conducted with professional FWA investigators confirming the usefulness of our system.

Results
We evaluated Clais along three dimensions: extraction of rules from policy documents, execution of rules on claim records to identify non-compliance, and perceived usefulness by professional FWA investigators.
For the first two tasks (extraction and execution of rules) we built a ground truth 24,25,29 with the help of domain experts who manually defined rules corresponding to policy documents for dental providers for the states of Iowa 30 and Colorado 31 (US)-all domain experts who helped design and test Clais are professional FWA investigators.The resulting 141 ground truth rules are based on the same ontology that guides Clais automatic extraction of rules from text, and are formalized into knowledge graphs; there are 90 rules from the Iowa policy document (a knowledge graph with 1977 vertices and 2447 edges), and 51 rules from the Colorado policy document (a knowledge graph with 1651 vertices and 2044 edges); the ontology and the ground truth rules are publicly available 32 .The ground truth rules include examples of all rule types that domain experts identified as useful to check claims compliance: 112 service limitation rules (which define constraints on the number of units or the monetary amount a provider can bill for a service per patient over a period of time), 25 mutually exclusive rules (which define disjoint pairs of services that cannot be reimbursed together for the same patient, usually Figure 1.Overview of Clais, a human-AI collaborative system to investigate the compliance of healthcare claims.The system takes as input a policy document (a), it identifies fragments of text that potentially contain rules defining compliance (or non-compliance) of claims (see for example the sentence highlighted in yellow).The AI system uses a domain ontology (b) to guide the translation of text into a formal rule knowledge graph (d).The system visualizes the rule knowledge graph in a user interface (c) that displays both the original fragment of text from the policy document and the conditions of the rule in a human-understandable format.Human experts can edit the rule, and finally validate it using the user interface; they can also compose (using a library of conditions) new rules corresponding to fragments of policy text that the system failed to identify.After human validation, the AI system performs a normalization step (e) that produces an executable rule knowledge graph.If the original rule is a compliance rule, then the normalization transforms it into a non-compliance rule (by reversing compliance conditions).Finally, the normalization step produces a mapping from the rule conditions (as defined in the ontology) to the fields in claims records, thus making it possible to execute the rule knowledge graph on the claims data to identify claims at risk.The system displays the results of the rule execution in a user interface (f), where human experts can see aggregated statistics (for example distribution of the number of claims at risk per patient, distribution of patient age vs number of claims at risk, total number of claims at risk, and their estimated value, etc.).The user interface also shows the details of each claim at risk (in the context of the sequence of claims for the patient), including a human understandable explanation of which values in the claim record violate conditions in the rule.www.nature.com/scientificreports/within a time window), and 4 non-coverage rules (which explicitly list services that cannot be reimbursed under certain conditions).Additionally, the domain experts manually developed 20 ground truth algorithms (using their standard development tools and environment) corresponding to 20 ground truth rules.We executed the ground truth algorithms to label claims at-risk from a database D of 798,181,509 patient claims, 44,913,580 of which (5.63%) are dental claims.The records in the claims database were anonymized and unlabeled; they were extracted from the MarketScan database 33,34 , which contains data from more than 40 contributing health plans and captures data from more than 250 million unique individuals.The 20 ground truth algorithms identified 421,321 positive samples in the database D (potentially non-compliant claims), which corresponds to 0.94% of the dental claims in the database.We used the 141 ground truth rules to measure the performance of Clais in extracting correct rules from policy documents, and we used the claim records labelled by the 20 ground truth algorithms to measure the performance of Clais in identifying non-compliant claims.We computed precision, recall and F1 (the harmonic mean of precision and recall) for the extraction task.Given the set G of ground truth rules and the set E of automatically extracted rules, we computed precision as |G∩ * E|/|E| (the fraction of the extracted rules that fully or partially match ground truth rules) and recall as |G∩ * E|/|G| (the fraction of ground truth rules that are fully or partially extracted).The intersection ∩ * includes both fully and partially matched rules.An extracted rule R x,E ∈ E fully or partially matches a ground truth rule R x,G ∈ G if R x,E and R x,G originate from the same text fragment in the same document, and there is a non-empty intersection between the set of conditions (and corresponding values) of R x,E and the set of conditions (and cor- responding values) of R x,G .We extended the classical definitions of precision and recall in information retrieval 35 to include partially matched rules because domain experts confirmed their usefulness (it is often simpler to correct problems in a partially correct rule, rather than creating an entirely new one).
Additionally, we defined three similarity metrics (formal definitions in "Formal definition of rule and similarity metrics") to measure the similarity of an extracted rule R x,E to the corresponding ground truth rule R x,G : structure similarity (which measures the similarity of the logical structure of the two rules as the Sørensen-Dice coefficient 36,37 of the set of conditions of R x,E and the set of conditions of R x,G ); condition similarity (which measures the similarity among the values of each condition in R x,E with respect to the values of the correspond- ing condition in R x,G ); and an overall rule similarity (the arithmetic mean of the structure similarity and the condition similarities for all the conditions in the two rules).
We used hyper-parameter optimization to tune the configurations of the various sub-components of our rule extraction system.Table 1 shows that using a configuration optimized for F1, Clais achieved 0.72 F1-score when evaluated using all ground truth rules.The overall rule similarity was 0.75 (IQR = 0.58 to 0.95) when evaluated using all ground truth rules and a configuration optimized for rule similarity.Figure 2 shows the distributions of values of the rule similarity metrics and their empirical cumulative distribution functions in our evaluations.In our current implementation of Clais (where professionals curate extracted rules before execution on claims data), we used the configuration optimized for F1 because it enabled the system to extract 5% more rules.Although the structure of the rules extracted using this configuration was slightly less accurate (the rule similarity dropped Table 1.Evaluation of automatic extraction of rules.The table reports metrics when evaluating Clais with the 90 ground truth rules from the Iowa policy document, the 51 ground truth rules from the Colorado policy document, and all 141 ground truth rules.We report evaluation metrics for two configurations of the extraction system: optimized for F1, and optimized for rule similarity; for each configuration we report precision, recall, F1, and the median and IQR of our three rule similarity metrics.Significant values are in bold.www.nature.com/scientificreports/from 0.75 to 0.63), we observed that professionals preferred to use our visual interface (Fig. 1c) to amend partially incorrect rules rather than adding completely new ones.
Executing rules on claim records to identify non-compliance is a typical classification task: the positive class consists of non-compliant claims.We measured precision ( tp/ tp + fp ), recall ( tp/ tp + fn ), and F1 (the har- monic mean of precision and recall) for every rule implemented as a ground-truth algorithm ( tp, tn, fn are respec- tively the number of true positive, true negative and false negative outcomes).Clais automatically extracted 18 (90%) of the ground-truth rules corresponding to the algorithms manually developed by domain experts.When executing the 18 automatically extracted rules on the claims database to identify non-compliance, the median precision, recall and F1 of our system were 1.0 (IQR = 0.88 to 1.0), 0.99 (IQR = 0.99 to 1.0), and 1.0 (IQR = 0.83 to 1.0), respectively.We asked the domain experts to use our visual interface (see Fig. 1c) to validate the 18 rules that Clais automatically extracted: 11 rules (61%) did not need any corrections; the remaining seven rules had two types of problems: they either missed a condition, or had one or more incorrect values in an existing condition.Our system achieved F1 1.00 (IQR = 1.0 to 1.0) when executing the validated rules on the claims database D.
To compare the results of our system with more traditional approaches found in literature, we developed two baseline models: a gradient boosting model and a deep neural network.We split the labelled data for each ground-truth rule into training, validation, and test sets (54%, 6%, and 40% of the data, respectively), and we evaluated Clais and the two baseline models on the same test set.For a fair comparison, we used the automatically extracted rules (instead of rules curated by human experts) when evaluating Clais.Figure 3a compares precision, recall and F1 for Clais and the baseline models.Clais significantly outperformed both the gradient boosting model, and the deep neural network: a one-tailed Mann-Whitney test indicates that precision, recall, and F1 are greater for Clais than for any of the two baseline models.
A distinctive characteristic of our system is its ability to explain the results obtained by executing the rules on claims data (screenshots in Figs.1f and 3b): explanations are human understandable and include reference to the conditions of the rule extracted from the policy document.The user can choose a fragment of a policy document corresponding to one of the executed rules; the user interface displays aggregated statistics about the claims that the selected rule identified as non-compliant, and a human-friendly summary of the conditions of the rule.Additionally, the user may choose a specific non-compliant claim and see detailed information including data of related claims in the patient history, and a sentence explaining why it is not compliant-for example (Fig. 3b) "D4355 and D1110 are mutually exclusive within 1 day(s)" (D4355 and other similar codes used in The distributions of values (box plots) and the empirical cumulative distribution functions of the rule similarity metrics in the evaluation using the ground truth rules from the Iowa policy document; rows show results for the three rule similarity metrics (structure, conditions, and overall rule similarity) with two configurations of our system (optimized for F1 and optimized for rule similarity).Subfigures (b,c) show data from the evaluation using the ground truth rules from the Colorado policy document (b), and using all ground truth rules (c).We observe that optimizing for rule similarity does not always have significant effects on improving the rule similarity metrics.In the evaluation with the ground truth rules from the Iowa policy document (a), the structure similarity and the overall rule similarity obtained when optimizing for rule similarity are significantly greater than the corresponding metrics obtained when optimizing for F1 (the p-value of one-tailed Mann-Whitney test is 0.01824 for the structure similarity and 0.04877 for the overall similarity).In the evaluation with the ground truth rules from the Colorado policy document (b) the optimization for rule similarity has no statistically significant effects.Considering all ground truth rules (c), the optimization for rule similarity shows statistically significant effects only for the structure similarity metrics (the p-value of one-tailed Mann-Whitney test is 0.03958).www.nature.com/scientificreports/claims refer to the procedure codes listed in the Code on Dental Procedures and Nomenclature 38 defined by the American Dental Association).Furthermore, we investigated the collaborative aspect of Clais, and specifically its perceived usefulness, with a user study involving 15 participants.All participants were expert policy investigators in the healthcare domain.We divided the participants in two groups: Internal and External.The Internal Group included seven participants who collaborated with us (explaining the domain, the challenges, and the requirements; developing the groundtruth rules, and the 20 ground-truth algorithms).The participants in the Internal Group had the opportunity to use Clais before the user study, for example to define rules, to validate extracted rules, and to analyze potentially non-compliant claims identified by our system.Conversely, the eight participants in the External Group had no exposure to Clais before the user study, except for a one-hour introductory tutorial that we delivered immediately before the user study.
The job role of the participants is either 'FWA investigator' or 'Data Analyst' .Experts from the two job roles typically work together when investigating providers' claims.FWA investigators often start their work by analyzing a policy document; data analysts typically work more with claims data, either querying them, or writing algorithms to analyze them.The two job roles are almost equally represented in the overall group of participants: seven Data Analysts (2 in the Internal Group and 5 in the External Group) and eight FWA investigators (5 in the Internal Group and 3 in the External Group).. We investigated the perceived usefulness of our system, which we measured using two well-established standard questionnaires: the PUEU questionnaire 39 , and the USE questionnaire 40 .Both the PUEU and the USE questionnaires use a 7-point Likert scale 41 , where 1 = extremely unlikely, 2 = quite unlikely, 3 = slightly unlikely, 4 = neither unlikely nor likely, 5 = slightly likely, 6 = quite likely, and 7 = extremely likely.The PUEU questionnaire provides a measurement scale for the two variables "perceived usefulness" (PU) (which is defined as "the degree to which a person believes that using a particular system would enhance his or her job performance"), and "perceived ease of use" (EU) (which is defined as "the degree to which a person believes that using a particular system would be free of effort").The PUEU questionnaire contains 12 questions, 6 for each of the two variables.The USE questionnaire (USE stands for usability, satisfaction, and ease of use) has four sections, measuring usefulness (8 questions), ease of use (11 questions), ease of learning (4 questions) and satisfaction (7 questions).We asked participants of the user study to answer all questions of the USE questionnaire; however, we observe that while the first two sections (usefulness and ease of use) are directly comparable with the respective sections of the PUEU questionnaire, the last two sections (ease of learning and satisfaction) are not comparable with PUEU, and are of less interest for our study, which investigated the perceived usefulness of Clais.
All participants answered all questions of the PUEU questionnaire, and all questions of the usefulness section of the USE questionnaire; we had 4.22% missing answers in the other sections of the USE questionnaire.We decided to assign the neutral score 4 of the 7-point Likert scale to the missing answers, because its meaning ("neither unlikely nor likely") is semantically consistent with the respondent not providing any answer to the question.
Table 2 reports the percentage of participants (also aggregated by group and job role) who answered positively (Likert score 5, 6, or 7) to all the questions related to usefulness-questions 1 to 6 of the PUEU questionnaire (PUEU [1:6]) and questions 1 to 8 of the USE questionnaire (USE [1:8]).No participant answered negatively (Likert score 1, 2, or 3) or neutrally (Likert score 4) to all questions related to usefulness. Figure 4 shows the frequency of answers to the Perceived Usefulness section of the PUEU questionnaire and the Usefulness section of the USE questionnaire.Answers are concentrated in the positive range of the Likert scale (scores 5, 6, and 7).
We analyzed in more detail the answers addressing the usefulness of the system in both questionnaires (PUEU [1:6] and USE [1:8]): we wanted to investigate if the tendency to answer positively to such questions was the same for various subgroups of users who participated in the study, and if positive answers were more likely than negative answers.The categorical variable "answer positively to a questionnaire section" is based on the following assumptions: (a) an answer to a question was positive if its Likert score is greater than 4 (we consider neutral answers as non-positive), (b) a participant answered positively to a questionnaire section if she/he answered positively to more than half of the questions in the section.Given the small sample size, we used a two-tailed Fisher's exact test with the following pairs of subgroups: (1) "Internal Group and External Group", (2) "Data Analysts and FWA Investigators", (3) "Data Analysts in Internal Group and FWA Investigators in Internal Group", (4) "Data Analysts in External Group and FWA Investigators in External Group", (5) "Data Analysts in  www.nature.com/scientificreports/section is the same for any pair of subgroups.For all 12 cases (the 2 questionnaire sections PUEU [1:6] and USE [1:8], and the 6 pairs of subgroups), we fail to reject the null hypothesis, and therefore we cannot find (given the data collected in our user study) a statistically significant difference in the tendency to answer positively to PUEU [1:6] or USE [1:8] for all pairs of subgroups.
To investigate if positive answers were more likely than negative answers, we made the same assumptions (a) and (b), and additionally, we considered that participants answered independently and without influencing each other (because of how the user study was conducted).Under these assumptions, we used a one-sided binomial test, and our null hypothesis was that no more than 50% of the population answered positively to PUEU [1:6]  or USE [1:8].The data collected in our user study allowed us to reject the null hypothesis when considering all participants, or those having a job role of FWA Investigators (both in the Internal and External Group).More precisely, for PUEU [1:6] a positive answer is significantly more likely than a negative answer when looking at data collected from all professionals (empirical proportion = 0.93, p-value = 0.00049, 95% confidence interval = [0.72,1.00]), or from FWA Investigators (empirical proportion = 1.0, p-value = 0. 0039, 95% confidence interval = [0.69,1.00]).For USE [1:8] a positive answer is significantly more likely than a negative answer when looking at data collected from all professionals (empirical proportion = 0.8, p-value = 0.0176, 95% confidence interval = [0.56,1.00]), or from FWA Investigators (empirical proportion = 1.0, p-value = 0.0039, 95% confidence interval = [0.69,1.00]).

Discussion
In this study, we present Clais, a collaborative AI system for claims analysis, which supports the workflow of professionals in healthcare fraud, waste, and abuse, and helps them identify non-compliance in providers' claims.Clais overcomes limitations of existing systems that use manually defined compliance rules (costly and difficult to maintain) or data-driven machine learning algorithms (requiring large amounts of high-quality labelled data, and possibly lacking interpretability).Clais automatically extracts rules from healthcare policy documents.The rules are human-interpretable: professionals can interact with them in a visual interface enabling modification and validation of the rules.Clais executes the (validated) rules directly on claim records, identifies non-compliant claims, and reports their data, the violated rule(s), the corresponding fragment(s) of policy text, and a human understandable explanation of why a specific claim is not compliant.Clais ability to provide useful and human understandable explanations of its results (confirmed in our user study) is a step forward in the direction of trustworthy artificial intelligence, and specifically in the possibility of achieving trust through counterfactual explanations 42 .
The automatic extraction of executable rules from policy text is a complex task.To the best of our knowledge, there is no prior work in the healthcare domain addressing this task, and there is no available ground-truth data for building or evaluating systems.We contribute as open source a set of ground truth rules and a related ontology 32 .The findings of this study suggest that Clais is effective at automatically extracting (partially) correct rules from policy documents, and complements recent work in the area of 'Rules As Code' [4][5][6][7] showing that an AI system can collaborate with humans to create machine-consumable and human-understandable rules to accompany existing natural language policy documents.This study and our system have some limitations.Firstly, we tested Clais on policy documents in the dental domain: extension to other domains requires adapting the ontology that guides the extraction of rules from text, and possibly fine tuning the extraction pipeline to recognize domain-specific entities from text (other components of Clais are domain-independent).Secondly, the ground truth rules that we developed may not be an exhaustive set, and future work includes developing a more extensive one.Nevertheless, we observe that our ground truth rules are very diverse: Fig. 5 compares four similarity metrics for all ground truth rules and for the 20 rules corresponding to the ground truth algorithms.Considering all ground truth rules, the median text similarity is 0.64 (IQR = 0.61 to 0.67), the median structure similarity is 0.2 (IQR = 0.13 to 0.4), the median for the average condition similarity is 0 (IQR = 0 to 0.04), and the median rule similarity is 0.03 (IQR = 0.01 to 0.14).Ground truth rules are quite similar from a textual point of view (the text similarity is the angular similarity between embedding vectors encoding the texts); the bivariate histogram in Fig. 5e shows that most rules have text similarity between 0.6 and 0.7 but a rule similarity very close to 0. The relatively high text similarity is not surprising: text fragments describe compliance and non-compliance regulations in the dental domain, and therefore the variations in terminology and sentence structure are small.Additionally, two rules (typically a service limitation rule and a rule describing services that are mutually exclusive) may often refer to the same text fragment.Figure 5k shows a typical example of two ground truth rules having high text similarity (0.75) but low rule similarity (0.35): the structure similarity is high (both rules define service limitations and have similar conditions), but the values of the conditions (even common ones) are very different, and therefore the average condition similarity and the overall rule similarity are low.Although sentences describing our ground truth rules are relatively similar, we observe that the metrics measuring their logical similarity (structure, conditions, and overall rule similarity) are very low: considering all distinct pairs of ground truth rules having a text similarity greater than the median value (0.64), only 4% have a rule similarity greater than 0.5, and only 1% have a rule similarity greater than 0.7: this shows a considerable diversity of the logical fabric of our ground truth rules.The rules corresponding to the ground truth algorithm are also largely diverse: the median text similarity is 0.67 (IQR = 0.65 to 0.72), the median structure similarity is 0.33 (IQR = 0.17 to 0.75), the median for the average condition similarity is 0.10 (IQR = 0.00 to 0.21), and the median rule similarity is 0.17 (IQR = 0.02 to 0.29); considering all distinct pairs of ground truth algorithms having a text similarity greater than the median value (0.67), only 14% have a rule similarity greater than 0.5, and only 6% have a rule similarity greater than 0.7.The diversity of our ground truth helps to support our findings related to the performance of Clais in automatically extracting rules from text and in executing the rules to identify non-compliant claims.Further studies may extend our ground truth to evaluate performance in other medical domains.www.nature.com/scientificreports/Previous work 15,17,21 has used machine learning and deep learning to identify fraudulent claims and providers in public datasets.However, existing methods do not consider policy documents, but only historical claims that have been labelled as fraudulent.Our approach is different.Firstly, it does not require training on historical labelled data, but it can identify potentially non-compliant claims even without access to previous (similar) samples.Secondly, it promotes standardization and re-usability by formalizing policy text into structured rules that are both human-interpretable and machine-executable.We cannot directly compare our approach with previous work 15,17,21 , because their datasets do not include the policy documents that Clais requires to extract rules and do not provide compliance information at the granularity of a single claim (which is what we aim to detect with our system) but only at provider level.For these reasons we trained two baseline models (a gradient boosting model and a deep neural network) on the same dataset of labelled claims that we use to execute the rules extracted by Clais.Our findings show that our system significantly outperforms both baseline models (see Fig. 3).There are two cases (see rules R 12 and R 14 in Fig. 3) in which the gradient boosting model exhibits a better F1-score than Clais.Rule R 12 corresponds to the following policy text: "an oral evaluation for children under three years of age and counseling with the primary caregiver (D0145) is payable once every six months".Our system misinterprets the expression "under three years" and assigns the value "3" to the rule condition "hasMaxAge", thus including patients of three years of age: this error has a negative impact on precision when executing the rule.Rule R 14 also has a problem with a value that is not correctly extracted from the following text: "complete and partial dentures are payable once in a five-year period".The text describes a service limitation on "complete dentures" and "partial dentures"; the automatically extracted rule contains the condition "hasApplicableService" to express the service limitation, but its value includes only "partial dentures".This error has a negative impact on recall when executing the rule.For both rules R 12 and R 14 , Clais uses correct conditions but fails to extract correct values; this kind of errors is easy to identify and correct by human experts when validating the rules using Clais user interface.
In general, we observe that gradient boosting models have poor performance because most rules need some aggregation of historical claims data to verify compliance.Such aggregations require manual feature engineering for each specific rule, with an implementation effort comparable to direct implementation of rule algorithms by policy experts.The deep neural network, which we implemented as a recurrent neural network [44][45][46] to capture the temporal nature of historical claims data, also has very poor performance.Besides the well-known problems of this type of network (vanishing or exploding gradient 47 ), we observe that rules defined in policy documents usually have different temporal aggregations (ranging from months to years).We speculate that such variety of temporal aggregations makes training of the neural network difficult, and unstable, thus causing generalization problems which prevents the network from learning useful patterns.
Finally, the findings in this study confirm that professionals perceive Clais as useful to support their work.We observe that the tendency to answer positively to the usefulness sections of the PUEU and USE questionnaires is predominant for the job role FWA Investigators.This suggests that the design and development of Clais has been influenced by professionals having this job role (71% of the experts in the Internal Group are FWA Investigators).In accordance with the previous observation, we note that Data Analysts in the External Group provided comments to their answers in the user study suggesting additional features related to the exploration of claims data (for example, they asked to see the queries run by our system to identify claims at risk for a given rule, or the mapping between the conditions of a rule and the fields of the claims database).Such comments are very useful to plan future development of Clais.

Method Formal definition of rule and similarity metrics
A rule is a logical representation of a section of text in a policy document.We formally define a rule R = n i=1 C i as the conjunction of the set of conditions C(R) = {C 1 , C 2 , . . ., C n } , where each condition C i is defined by a prop- erty in our ontology, for example "hasMinAge" or "hasExcludedService".The ontology also specifies restrictions on properties such as cardinality or expected data types.We refer to the set of values of condition C i in rule R as V(C i , R) = {v 1 , v 2 , . . ., v m } ; multiple values are interpreted as a disjunction.If V(C i , R) contains only one numeric value (for example "hasMinAge(12)"), then we refer to C i as a numeric condition.
Given two rules R x and R y , and a condition C i , we define the condition similarity metric as: If C i is a numeric condition, and is present in both R x and R y , then the condition similarity S C C i , R x , R y is inversely proportional to the distance between v i,x (the numeric value of C i in R x ) and v i,y (the numeric value of C i in R y ).If instead, C i is present in both R x and R y , but it is not numeric, then the condition similarity S C C i , R x , R y is the Jaccard similarity 48 between the set of values of C i in R x and the set of values of C i in R y .Finally, when C i is missing in either R x or R y , the condition similarity S C C i , R x , R y is equal to 0.
Given two rules R x and R y , we also define the rule structure similarity as the Jaccard similarity between the set of conditions in R x and the set of conditions in R y :

Extraction of rules from text
Clais uses knowledge graphs to represent rules.A knowledge graph is a directed labelled graph (example in Fig. 1d), and we encode its structure and semantics using RDF 52 triples (subject, predicate, object).Our ontology 32 (excerpt in Fig. 1b), designed in collaboration with expert policy investigators 25 , formally defines the meaning of subjects, predicates and objects used in the rule knowledge graphs.The ontology also specifies restrictions on the predicates (for example, expected domain and range, disjoint or cardinality constraints), which guide our system in building semantically valid RDF triples and meaningful rules.Additionally, the ontology defines rule types, which consist of concepts-relationships templates capturing repeatable linguistic patterns in policy documents.The current version of the ontology specifies three rule types: (1) limitations on services, such as units of service or reimbursable monetary amounts that a provider can report for a single beneficiary over a given period; (2) mutually exclusive procedures that cannot be billed together for the same patient over a period; and (3) services not covered by a policy under certain conditions.The ontology design ensures that every rule knowledge graph can be modelled as a tree (an undirected graph in which any two vertices are connected by exactly one path), where leaves are values of the conditions in the rule.The tree representation enables Clais to visualize the rule's conditions and their values in an intuitive user interface, which simplifies editing and validation of the rule.The same user interface also supports the interactive creation of new rules: professionals compose a rule by selecting items from a library of conditions based on the property defined in the ontology; the system asks for condition values and checks their validity in accordance with the restrictions defined for the corresponding ontology predicate (for example domain, range, cardinality).
We build upon recent natural language processing (NLP) techniques 24,25 to automatically identify dependencies between relevant entities and relations described in a fragment of policy text, and to assemble them into a rule.Clais uses a configurable NLP extraction pipeline, where each component can be replaced or complemented by others with similar functionalities.The configuration can be customized either manually or using hyper-parameter optimization 53,54 to tune the overall performance of the extraction pipeline for a given policy, domain or geographic region (specifically, we use the Optuna 55 hyperparameter optimization framework).Clais NLP extraction pipeline (Fig. 6), which does not require labelled data, consists of the following steps: (1) data preparation according to the policy domain and geography (state/region); (2) automatic annotation of policy text fragments to identify mentions of domain entities and relations in the text; (3) building of rule knowledge graphs corresponding to policy text segments using their domain entities and relations in accordance to the ontology definitions; (4) knowledge graph consolidation and filtering to produce a set of well-formed rules (necessary and C i is not numeric 0 otherwise https://doi.org/10.1038/s41598-024-62665-0 https://doi.org/10.1038/s41598-024-62665-0

Figure 2 .
Figure 2. (a)The distributions of values (box plots) and the empirical cumulative distribution functions of the rule similarity metrics in the evaluation using the ground truth rules from the Iowa policy document; rows show results for the three rule similarity metrics (structure, conditions, and overall rule similarity) with two configurations of our system (optimized for F1 and optimized for rule similarity).Subfigures (b,c) show data from the evaluation using the ground truth rules from the Colorado policy document (b), and using all ground truth rules (c).We observe that optimizing for rule similarity does not always have significant effects on improving the rule similarity metrics.In the evaluation with the ground truth rules from the Iowa policy document (a), the structure similarity and the overall rule similarity obtained when optimizing for rule similarity are significantly greater than the corresponding metrics obtained when optimizing for F1 (the p-value of one-tailed Mann-Whitney test is 0.01824 for the structure similarity and 0.04877 for the overall similarity).In the evaluation with the ground truth rules from the Colorado policy document (b) the optimization for rule similarity has no statistically significant effects.Considering all ground truth rules (c), the optimization for rule similarity shows statistically significant effects only for the structure similarity metrics (the p-value of one-tailed Mann-Whitney test is 0.03958).

Figure 3 .
Figure 3. (a) Comparison of precision, recall and F1-score for identification of non-compliant claims using Clais and two baseline models (gradient boosting and deep neural network).The baseline models use a subset of the labelled claims data for training, and therefore the evaluation of the three systems is done using a disjoint test-subset of the labelled data.When evaluating Clais we use the automatically extracted rules (as opposed to the same rules curated by human experts); our system does not automatically extract rules R 7 and R 15 .We also report quartiles of precision, recall and F1-score.The results of one-tailed Mann Whitney test indicate that Clais metrics significantly outperform the baseline models.(b) Clais user interface for analysing the results produced by the execution of rules on claims data.

Figure 4 .
Figure 4. Frequency of answers to (a) the perceived usefulness section of the PUEU questionnaire, and (b) the usefulness section of the USE questionnaire.The heatmaps show frequencies of responses for different groups of participants.Each heatmap shows the frequency of responses (Likert scores from 1 to 7; only values larger than 0 are reported); at the right of each heatmap we aggregate the overall frequency of positive responses (Likert scores 5, 6, and 7); similarly, at the left of each heatmap we aggregate the overall frequency of negative responses (Likert scores 1, 2, and 3).

Figure 5 .
Figure 5. Analysis of the pairwise similarity of the ground truth rules, and the ground truth algorithms.(a-i) are hierarchically clustered heatmaps of the text similarity, structure similarity, average condition similarity and overall rule similarity for ground truth rules and ground truth algorithms, respectively.Red markers in (a-d) identify ground truth algorithms.The dendrograms show how rules are clustered: we colour the leaves of the dendrograms to show how the clustering algorithm groups rules according to their type (mutual exclusion, non-coverage, and service limitation); for these plots we used seaborn43 clustermap function with default method (average) and default metric (Euclidean) to compute the hierarchical clusters.The numbers along the horizontal and vertical axis of (f-i) identify the rules corresponding to the ground truth algorithms.The bivariate histograms (e,j) show the distributions of the values of the rule similarity and text similarity metrics for every distinct pair R x , R y ( x = y)-(e) shows the distributions for ground truth rules, and (j) for ground truth algorithms.The similarity metrics are symmetric, and therefore we consider only distinct pairs: if we consider R x , R y then we omit R y , R x .The bivariate histograms show the distribution of the values of two similarity metrics by tiling the data space [0, 1] × [0, 1] with 2500 bins.The color of the bivariate histograms shows the percentage of observations in each bin.The marginal histograms show the distributions of text similarity (horizontal marginal) and the rule similarity (vertical marginal); both marginal histograms tile the respective data space [0, 1] with 50 bins, and the height of the bar is proportional to the percentage of observation in each bin.We used seaborn43 jointplot function to plot the bivariate histograms with marginals.(k) Compares in more details two ground truth algorithms ( R 1 and R 2 ); this is a typical example of two rules having high text similarity, but low rule similarity.

Table 2 .
Percentage of participants who answered positively (Likert score 5, 6, or 7) to all questions related to usefulness.